Following the tutorial at:
In [24]:
import pandas as pd
In [25]:
# There are two data structures in pandas, Series and DataFrames
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
In [26]:
pd.DataFrame({"City Name": city_names, "Population": population})
Out[26]:
In [27]:
# importing an existing csv file into DataFrame
california_housing_dataframe = pd.read_csv(
"https://storage.googleapis.com/mledu-datasets/california_housing_train.csv",
sep=","
)
In [28]:
california_housing_dataframe.shape
Out[28]:
In [29]:
california_housing_dataframe.head()
Out[29]:
In [30]:
california_housing_dataframe.hist('housing_median_age')
Out[30]:
In [31]:
cities = pd.DataFrame({'City Name': city_names, 'Population': population})
print(type(cities['City Name']))
cities['City Name']
Out[31]:
In [32]:
print(type(cities["City Name"][1]))
cities["City Name"][1]
Out[32]:
In [33]:
print(type(cities[0:2]))
cities[0:2]
Out[33]:
In [36]:
population / 1000
Out[36]:
In [37]:
import numpy as np
np.log(population)
Out[37]:
In [40]:
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities
Out[40]:
In [39]:
population.apply(lambda val: val > 1000000)
Out[39]:
Modify the cities
table by adding a new boolean column that is True if and only if both of the following are True:
Note: Boolean Series
are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing logical and, use &
instead of and
.
Hint: "San" in Spanish means "saint."
In [46]:
cities['is saint and wide'] = (cities['Area square miles'] > 50) & (cities['City Name'].apply(lambda name: name.startswith("San")))
cities
Out[46]:
Both Series
and DataFrame
objects also define an index
property that assigns an identifier value to each Series
item or DataFrame
row.
By default, at construction, pandas assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.
In [47]:
city_names.index
Out[47]:
In [48]:
cities.index
Out[48]:
In [50]:
cities.reindex([2, 0, 1])
Out[50]:
Reindexing is a great way to shuffle (randomize) a DataFrame
. In the example below, we take the index, which is array-like, and pass it to NumPy's random.permutation
function, which shuffles its values in place. Calling reindex
with this shuffled array causes the DataFrame
rows to be shuffled in the same way.
In [52]:
cities.reindex(np.random.permutation(cities.index))
Out[52]:
In [53]:
cities.reindex([4, 2, 1, 3, 0])
Out[53]:
In [ ]: